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Abstract 

In many fields of science, generalized likelihood ratio tests are established tools 
for statistical inference. At the same time, it has become increasingly common that 
a simulator (or generative model) is used to describe complex processes that tie pa¬ 
rameters d of an underlying theory and measurement apparatus to high-dimensional 
observations x G W. However, simulator often do not provide a way to evaluate 
the likelihood function for a given observation x, which motivates a new class of 
likelihood-free inference algorithms. In this paper, we show that likelihood ratios are 
invariant under a specific class of dimensionality reduction maps MP i—)• M. As a di¬ 
rect consequence, we show that discriminative classihers can be used to approximate 
the generalized likelihood ratio statistic when only a generative model for the data 
is available. This leads to a new machine learning-based approach to likelihood-free 
inference that is complementary to Approximate Bayesian Computation, and which 
does not require a prior on the model parameters. Experimental results on artih- 
cial problems with known exact likelihoods illustrate the potential of the proposed 
method. 

Keywords: likelihood ratio, likelihood-free inference, classihcation, particle physics, surro¬ 
gate model 
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1 Introduction 


The likelihood function is the central object that summarizes the information from an ex¬ 
periment needed for inference of model parameters. It is key to many areas of science that 
report the results of classical hypothesis tests or confidence intervals using the (generalized 
or prohle) likelihood ratio as a test statistic. At the same time, with the advance of comput¬ 
ing technology, it has become increasingly common that a simulator (or generative model) 
is used to describe complex processes that tie parameters 6 of an underlying theory and 
measurement apparatus to high-dimensional observations x. However, directly evaluating 
the likelihood function in these cases is often impossible or is computationally impractical. 

The main result of this paper is to show that the likelihood ratio is invariant under 
dimensionality reductions MP i—)■ M, under the assumption that the corresponding transfor¬ 
mation is itself monotonic with the likelihood ratio. As a direct consequence, we derive 
and propose an alternative machine learning-based approach for likelihood-free inference 
that can also be used in a classical (frequentist) setting where a prior over the model pa¬ 
rameters is not available. More specifically, we demonstrate that discriminative classifiers 
can be used to construct equivalent generalized likelihood ratio test statistics when only a 
generative model for the data is available for training and calibration. 

As a concrete example, let us consider searches for new particles at the Large Hadron 
Collider (LHC). The simulator that is sampling from p(x|0) is based on quantum field 
theory, a detailed simulation of the particle detector, and data processing algorithms that 


transform raw sensor data into the feature vector x (Sjostrand et ah, 2006 Agostinelli 


et ah, 2003). The ATLAS and CMS experiments have published hundreds of papers where 


the final result was formulated as a hypothesis test or confidence interval using a general¬ 


ized likelihood ratio test (Cowan et ah, 2010), including most notably the discovery of the 
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Higgs boson (The ATLAS Collaboration, 2012 The CMS Collaboration, 2012) and sub¬ 
sequent measurement of its properties. The bulk of the likelihood ratio tests at the LHC 
are based on the distribution of a single event-level feature that discriminates between a 
hypothesized process of interest (labeled signal) and various other processes (labeled back¬ 
ground). Typically, data generated from the simulator are used to approximate the density 
at various parameter points, and an interpolation algorithm is used to approximate the 


parameterized model (Cranmer et ah, 2012). In order to improve the statistical power of 
these tests, hundreds of these searches have already been using supervised learning to train 
classihers to discriminate between two two discrete hypotheses based on a high dimensional 
feature vector x. The results of this paper outline how to extend the use of discriminative 
classihers for composite hypotheses (parameterized by 6) in a way that hts naturally into 
the established likelihood based inference techniques. 

The rest of the paper is organized as follows. In Sec. we hrst introduce the likelihood 
ratio test statistic in the setting of simple hypothesis testing, and then outline how it can 
be computed exactly using calibrated classihers. In Sec. we generalize the proposed ap¬ 
proach to the case of composite hypothesis testing and discuss directions for approximating 
the statistic efficiently. We then illustrate the proposed method in Sec. and outline how 
it could improve statistical analysis within the held of high energy physics. Related work 
and conclusions are hnally presented in Sections and 
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2 Likelihood ratio tests 


2.1 Simple hypothesis testing 


Let X be a random vector with values x G df C and let px(x|6*) denote the density 
probability of X at value x under the parameterization 9. Let also assume i.i.d. observed 
data V = {xi,... ,x„}. In the setting where one is interested in simple hypothesis testing 
between a null 6 = 6o against an alternate 6 = 6i, the Neyman-Pearson lemma states that 
the likelihood ratio 


A(®ie„,«,) = n 

xen 


Px(x|go) 

Px(x|0i) 


( 2 . 1 ) 


is the most powerful test statistic. 

In order to evaluate X(T>), one must be able to evaluate the probability densities px(x|6*o) 
and px(x|6*i) at any value x. However, it is increasingly common in science that one has a 
complex simulation that can act as generative model for px(x|6'), but one cannot evaluate 


the density directly. For instance, this is the case in high energy physics (Neal, 2007) where 
the simulation of particle detectors can only be done in the forward mode. 


2.2 Approximating likelihood ratios with classifiers 


The main result of this paper is to generalize the observation that one can form a test 
statistic 


A'(Bi9„,9i)= 

xec 


Pvju 

Pv{u 


g(x)|go) 

s(x)|0i) 


( 2 . 2 ) 


that is strictly equivalent to 2.1, provided the change of variable U = s(X) is based on a 
(parameterized) function s that is strictly monotonic with the density ratio 


r(x;6'o,6'i) 


Px(x|go) 

Px(x|0i)’ 


(2.3) 
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As derived below, this allows to recast the original likelihood ratio test into an alternate 
form in which supervised learning can be used to build s(x) as a discriminative classiher. 
In Sec. I^we extend this result to generalized likelihood ratio tests, where it will be useful 
to have the classiher decision function s parameterized in terms of {6q,6i). 


Theorem 1. Let ^ be a random vector with values in X O ML and parameterized probability 
density px(x = (xi, ...,Xp)\6) and let s : M^ M be a function monotonic with the density 
ratio r(x; 00, ^i), for given parameters 9 q and 9i. In these conditions, 

Px(x|0o) = s(x)|0o) 


r(x;0o,^i) = / N - / ^ 

px(x|0i) Pu(m = ■s(x)|0i) 

where pxi{u = s(x; 9q, 0i)|0) is the induced probability density o/U = s(X; 9o, 9i] 


(2.4) 


Proof. Starting from the dehnition of the probability density function, we have 


Pu(m = s(x)|0o) = 

du 


I Px(x'|0o)c?x' 

{x^s(x^)<w} 

= f S{u- s(x'))px(x'|9o)4x' 


(2.5) 


Intuitively, this expression can be understood as the integral over all x' G M^ such that 


s(x') = u, as picked by the Dirac 6 function. Given Theorem 6.1.5 of Hormander (1990), 
it further comes 


Pu{u = s(x)|0o) = 


-7YtPx(x'|0o)c?5'x 


( 2 . 6 ) 


'en„ |Vs(x')r 

where flu = W ■ s(x') = u}, |Vs(x')| = \jY7i=i where dS'x' is the Euclidean 

surface measure on flu- Also, since s(x) is monotonic with r(X;0O;^i); there exists an 
invertible function m : M’*' i—)■ M such that s(x) = m{r(K] 9o, 0i)). In particular, we have 

Px(x|0o) 


Px(x|9.) 

Px(x|0o) = m"^(s(x))px(x|0i) 


(2.7) 


5 










Combining equations 2.6 and 2.7, the density ratio r(X;6'o,6'i) can be pulled out of the 
integral, resulting in 

1 


Pu{u = s(x)|6'o) = 


-^m ^(s(x'))px(x'|6'i)ciS'x 


|Vs(x')| 

^ ^ —m“^(M)px(x'|6'i)dS'x 


Jn. |Vs(x')| 

= m"^(s(x)) |vg|x/^| ^x(x |gi)d^x' 

Px(x|0o) f 1 ^ 

= —/ |V7 / A| Px X 01 d^x'. 

Px X 01 Vs x' 


( 2 . 8 ) 


Similarly, Equation 2.6 can be used to derive pv{u = s(x)|0i), hnally yielding 

Pu{u = s(x)|0o) _ px(x|0o) |v4x')|Px(x'|0i)d5'x' 

Pu{u = s(x)|0i) px(x|0i) ^^px(x'|0i)d^x' 

^ Px(x|0o) 

Px(x|0i)‘ 


(2.9) 

□ 


In light of this result, the likelihood ratio estimation problem can now be recast as a 
(probabilistic) classihcation problem, by noticing that the decision function 

Px(x|0i) 


S X = 


( 2 . 10 ) 


Px(x|0o) +Px(x|0i)‘ 

modeled by a classiher trained to distinguish samples x ~ pe^ from samples x ~ pg-^ 
satishes the conditions of Thm. (see Appendix]^. In other words, supervised learning 
yields a sufficient procedure for Thm. [T] to hold, guaranteeing that any universally strongly 
consistent algorithm can be used for learning s*. Note however, that it is not a necessary 
procedure since Thm. [T] holds for any monotonic function m of the density ratio, not only 
for m(r(x)) = (1 + r(x))~^. Equivalently, Thm. shows that in the case that we learn a 
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classifier s(x) which is imperfect up to a monotonic transformation of r(x), then one can still 
resort to calibration (i.e., modeling pu('W = s(x))) to compute r(x) exactly. For this reason, 
the proposed method is expected to be more robust than directly using (1 — s(x))/s(x) as 
an approximate of r(x) (which indeed converges towards r{x) when s{x) tends to s*(a;)). 


2.3 Learning and calibrating s 

In order for the proposed approach to be useful in the likelihood-free setting, we need to 
be able to approximate both s(x) and p(s(x)|6') based on a hnite number of samples {xj} 
drawn from the generative model p{x.\6). 

As outlined above, any consistent probabilistic classification algorithm can be used for 


learning an approximate map s(x) of Eqn. 2.10 In the common case where the density 
ratio is expected to smoothly vary around x, we would however recommend learning models 
whose output value s(x) also smoothly varies around x, such as neural networks. For small 
training sets, tree-based methods are not expected to work so well for this use case, since 
they usually model s(x) as a non-strictly monotonic composition of step functions. In such 
cases where s(x) is not monotonic with r(x), the induced probability does not factorize 


as in Eqn. |2.8[ leading to artifacts in the resulting approximation of the density ratio. 
Provided enough training data, accurate results can however still be achieved, given the 
universal approximator capacity of tree-based models. 

Given a reduction map s, our results show that a statistic equivalent to the likelihood 
ratio can be constructed, provided p(s(x)|6*) can be evaluated. Again, we do not have 
a direct and exact way for evaluating this density, but an approximation p(s(x)|6') can 
be built instead, e.g. using density estimation algorithms, such as histograms or kernel 
density estimation applied to {s(xj)}, where the {x*} are drawn from the generative model. 
Most notably, learning such an approximation of p(s(x)|0) is a much simpler problem than 
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learning p(x|0), since the reduction s projects x into a one-dimensional space in which only 
the (simpler) informative content of r(x) is preserved. 

An alternative approach for calibration is to approximate the density ratio r(s(x)) 
directly. For example, isotonic regression, which is commonly used to transform the clas- 
siher score s(x) into Siso(x) that more accurately reflect the posterior probability s*(x) 
of Eqn. 2.10[ can be used for calibration. This is done by inverting the relationship 
r(x) = (1 — s*(x))/s*(x) to obtain f(s(x)) = (1 — .Sigo(x))/siso(x). Additionally, Sec. 
describes related work in which the ratio f(x) is estimated directly on the feature space X. 

One strength of the proposed approach is that it factorizes the approximation of the di¬ 
mensionality reduction (s(x) s(x)) from the calibration procedure (p(s(x)|6') p(s(x)|6*) 

or f(s(x)) r(s(x))). Thus, even if the classiher does a poor job at learning the optimal 


decision function |2.10| and, therefore, at reproducing the level sets of the per-sample likeli¬ 
hood ratio, the density of s can still be well calibrated. In that case, one might loose power. 


but the resulting inference will still be valid. This point was made by Neal (2007) and is 
well appreciated by the particle physics community that typically takes a conservative at¬ 
titude towards the use of machine learning classihers precisely due to concerns about the 
calibration of p-values in the face of nuisance parameters associated to the simulator. 


3 Generalized likelihood ratio tests 

Thus far we have shown that the target likelihood ratio r(x;6'o,6'i) with high dimensional 
features x can be reproduced via the univariate densities p(s(x)|6*o) and p(s(x)|6*i) if the 
reduction s(x) is monotonic with r(x;0o,6'i). We now generalize from the ratio of two 
simple hypotheses specihed by and 9i to the case of composite hypothesis testing where 
6 are continuous model parameters. 
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3.1 Composite hypothesis testing 


In the case of composite hypotheses 0 G ©o against an alternative 6* G ©i (such that 
©0 n ©1 = 0 and ©o U ©i = ©), the generalized likelihood ratio test, also known as the 
prohle likelihood ratio test, is commonly used 


^ _ suPegeoP(^l^) 

^ sup,,eP(^^l^) ■ 


(3.1) 


This generalized likelihood ratio can be used both for hypothesis tests in the presence of 
nuisance parameters or to create conhdence intervals with or without nuisance parame¬ 
ters. Often, the parameter vector is broken into two components 6 = (p-,i^), where the /i 
components are considered parameters of interest while the u components are considered 
nuisance parameters. In that case ©o corresponds to all values of z/ with /i hxed. 


Evaluating the generalized likelihood ratio as dehned by Eqn. |3.1| requires Ending for 
both the numerator and the denominator the maximum likelihood estimator 


6 = argmaxp(P|6'). 
e 


(3.2) 


Again, this is made difficult in the likelihood-free setting and it is not obvious that we can 
find the same estimators if we are working instead with p(s(x)|0). Fortunately, there is a 


construction based on s that works: the maximum likelihood estimate of Eqn. |3.2| is the 
same as the value that maximizes the likelihood ratio with respect to p{V\6i), for some 
fixed value of 9i chosen such that the support of p(x|6*i) covers the support of p(x|6*). This 
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allows us to use Tlim. to reformulate the maximum likelihood estimate (MLE) 

6 = aig max p{V\6) 
e 

p{x\e) 


as 


= arg max 


xec 


= arg max 


p(x|0i) 

p(s(x;0,0i)|0) 


(3.3) 


e ^^P{s{x-0,ei)\di) ’ 
where s(x;6*,6*i) denotes a parameterized transformation s of X in terms of {0,6i) that 
is monotonic with r(x;6*,6*i). Note that it is important that we include the denominator 
p{s{x.;9,9i)\9i) because this cancels Jacobian factors that vary with 9. 

Finally, once the maximum likelihood estimates have been found for both the numerator 


and denominator of Eqn. 3.1, the generalized likelihood ratio can be estimated as outlined 


in Sec. 2.2 for simple hypothesis testing. 


3.2 Parameterized classification 

In order to provide parameter inference in the likelihood-free setting as described above, 
we must train a family s(x; 9q,9i) of classihers parameterized by 9q and 9i, the parameters 
associated to the null and alternate hypotheses, respectively. While this could be done 
independently for all 9o and 9i, using the procedure outlined in Sec. it is desirable 
and convenient to have a smooth evolution of the classihcation score as a function of the 
parameters. For this reason, we anticipate a single learning stage based on training data 
with input (x, 6*o, 6*i)j and target i/i, as outlined in Alg.[^ Somewhat unusually, the unknown 
values of the parameters are taken as input to the classiher; their values will be specihed 


via the enveloping (generalized) likelihood ratio of Eqn. 3.1 In this way, the parameterized 


classiher now models the distribution of the output y conditional to (x, for any x 

and any combination of parameter values 9q,9i. 
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Algorithm 1 Learning a parameterized classifier. 

r:={}; 

while size(T) < iV do 
Draw 9q ~ vre^; 

Draw X p(x|6*o); 
r:=ru{((x,0o,0i),l/ = O)}; 

Draw 6i ~ vrep 
Draw X p(x|6*i); 

r;=ru{((x,0o,0i),2/ = l)}; 

end while 

Learn a single classifier s(x; 6o,6i) from T. 


While the optimal decision fnnction 2.10 is expected to be learned for the parameter 
valnes 6*o and 6i selected in Alg. [TJ it is not clear whether the optimal decision fnnction can 
be expected for data generated from 6^ and 6[ never jointly enconntered dnring learning. 
Similarly, it is not clear how the limited capacity of the classifier may impact the perfor¬ 
mance of the resnlting parameterized decision fnnction. Preliminary exploration by [Baldi 


et ah (2016) shows that a nniform grid scan over parameter space is an effective practical 


approach; however, we introdnce the distribntions vr©;, and 7r©j into the Alg. to allow for 
a more sophisticated sampling strategy. 


3.3 Parameterized calibration 

Once the parameterized classifier s(x; 6o,9i) is trained, we can nse the generative model 


together with one of the calibration strategies discnssed in Sec. 2.3 for particnlar valnes of 
6o and 6i. For a single parameter point 6, this is a tractable nnivariate density estimation 
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problem. The challenge comes from the need to calibrate this density for all values of 
6. A straightforward approach would be to run the generative model on demand for any 
particular value of 6. In the context of a likelihood £t this would mean that the optimization 
algorithm that is trying to maximize the likelihood with respect to 6 needs access to the 
generative model p{x.\9). This is the strategy used for the examples presented in Sec. 

Calibrating the density on-demand can be impractical when the generative model is 
computationally expensive or has high-latency (for instance some human intervention is 
required to reconfigure the generative model). In high energy physics, where it is common 
to calibrate the distribution of a hxed classiher. There the strategy is to interpolate the 
distribution between discrete values of 6 in order to produce a continuous parameterization 
for p{s\9) (Read, 1999; Cranmer et ah, 2012; Baak et ah, 2015). One can easily imagine 
a number of approaches to parameterized calibration and the relative merits of those ap¬ 
proaches will depend critically on the dimensionality of 9 and the computational cost of 
the generative model. We leave a more general strategy for this overarching optimization 
problem as an area of future work. 


3.4 Mixture models 

In the special case of (simple or composite) hypothesis testing between models defined as 
known mixtures of several components, i.e. when ^(xl^) can be written as 

P(x|0) = ^Wc(%c(x|0), (3.4) 
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the target likelihood ratio can be formulated in terms of pairwise classihcation problems. 
Specihcally, we can write 


p(x|^o) 

p(x| 0 i) 


Y.c^c{0o)PciM0o) 

Y.C' Wc'{6i)pci{-x\6i) 

Wc'(^i)Pc'(x|6'i) 

C 

E 


E 

. c' 

E 


Wc{Qq) Pc(x|6'o) 


-1 


Wc'( 6 'i)pe'(-Sc,c'(x; 6 * 0 , 6 *i)| 6 >i) 
Wc{Qq) Pc(sc,c'(x;6'o,6'i)|6'o) 


-1 


(3.5) 


The second line is a trivial, but a useful decomposition into pairwise density ratio sub¬ 
problems between pc'{x\9i) and Pc(x| 6 'o). The third line uses Thm. to relate the high¬ 
dimensional likelihood ratio into an equivalent calibrated likelihood ratio based on the 
univariate density of the corresponding classiher. 

In applications where mixture models are commonly used, this decomposition allows 
one to construct better likelihood ratio estimates since it allows the classihers Sc,c' to focus 
on simpler sub-problems, for which higher accuracy is expected. 

Finally, as a technical point, in the situation where the only free parameters of the model 


are the mixture coefficients Wc, the distributions Pc(’Sc,c'(x; 6 *i)| 6 *) are independent of 9. 
For this reason, sub-ratios rc,c'(x; 9o, 9i) = simplify to which can 

be pre-computed without the need of parameterized classihcation or calibration. 


3.5 Diagnostics 

While Thm. [^states that the likelihood ratio r(x; ^o, 9i) is invariant under the dimension¬ 
ality reduction s(x; 9o,9i) provided that it is monotonic with r(x; ^o, ^i) itself and we know 
that any universally strongly consistent algorithm can be used to learn such a function, 
we know that in practice f(s(x; ^^i)) will not be exact. Thus, it is crucial that to have 
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some diagnostic procedures to assess the quality of this approximation. This is complicated 
by the fact that in the likelihood-free setting, we don’t have access to the true likelihood 
ratio. Below we consider two such diagnostic procedures that can be implemented in the 
likelihood-free setting. We illustrate these diagnostic procedures in Fig. 

The hrst diagnostic procedure is related to the procedure for hnding the MLE 6 in 


Eqn. |3.3[ As pointed out there it is important that one maximizes the likelihood ratio as 
the surface integral and Jacobian factors related to the dimensionality reduction only cancel 


in the ratio (see Eqn. 2.8). Importantly, they also only cancel if the reduction map satishes 


the assumptions of Thm. [Ij Moreover, the resulting value of 9 should be independent of 
the value of 6 i used in the denominator of the likelihood ratio. Similarly, we have 


log A( 6 ') = log 


P^O) 

p{v\e) 


p{V\9) p{V\9) 

p{V\9,) 


(3.6) 


for all values of 9i. Thus, by explicitly checking the independence of these quantities on 6 i 
we indirectly probe the quality of the approximation f(s(x; 9o,9i)) ~ r(s(x; 6*o, 6*i)). 

The second diagnostic procedure leverages the connection of this technique to direct 
density ratio estimation and its application to covariate shift and importance sampling. 
The idea is simple: we test the relationship ^(xl^o) = p(x| 6 'i)r(s(x; 6 *o, 6 ^ 1 )) with the ap¬ 
proximate ratio f(s(x; 9o,9i)) and samples drawn from the generative model. More specif¬ 
ically, we can train a classiher to distinguish between unweighted samples from p(xj9o) 
and samples from p{x.\9i) weighted by f(s(x; 6 ^ 0 , ^i)). If the classiher can distinguish be¬ 
tween the distributions, then f{s{x.;9o,9i)) is not a good approximation of r(s(x; 6 ^ 0 , ^i)). 
In contrast, if the classiher is unable to distinguish between the two distributions, then 
either f (s(x; 60 , 61 )) is a good approximation or the discriminator is not ehective. The two 
situations can be disentangled to some degree by training another classiher to distinguish 
between an unweighted distribution of samples from p(x| 6 *i). 
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4 Examples and applications 


In this section, we illustrate the proposed method on two representative examples where 
the exact likelihood is known and then discuss its application to high energy physics. The 
code used to produce the results and extended details for these examples is available in 
Ref. (Louppe et ah ,[2016 ), which utilizes the classihcation and calibration routines in scikit- 


learn (Pedregosa et ah, 2011). 


4.1 Likelihood ratios of mixtures of normals 

As a simple and illustrative example, let us first consider the approximation of the log- 
likelihood ratio log (r(x; 7 = 0.05 ,7 = 0)) between the ID mixtures p(x |7 = 0.05) and 
p(x |7 = 0 ) defined as 

p(x| 7 ) = (1 (4.1) 

where Pco •= -^(/^ = ~2, = 0.25^), Pci := ■A/'(/i = 0, cr^ = 4), Pc^ ■= = 1, cr^ = 0.25). 

Samples drawn for the nominal value 7 = 0.05 are shown in Figure [T^ and used later for 
inference. 


Figure lb shows the intermediate stages for the decomposition described in Sec. 3.4 


The blue and green curves show pc'(sc,c'(^))) fh® distributions for the score for the sub- 
classihers for the three pair-wise comparisons of the mixture components. The red curves 


in Fig. lb show the approximation of the density ratio (rescaled as (1 -|- r(x)) ^) obtained 
from those distributions. 


Figures Ic and Id show the approximate logf(x) as a function of x using a 2-layer 


neural network and a random forest for the classifier s(x). The neural network provides 
a smoother approximation, while the random has some artifacts due to the fact that the 
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decision function is piece-wise constant. The blue curves show the exact logr(x), the green 
curves show log((l — s(x))/s(x)) without calibration, while the red curve is the improved 
approximation logf(x) calibrated using histograms. Finally, the cyan curve shows the 


approximated log-likelihood ratio when decomposing the mixture, as seen in Fig. lb By 
leveraging the fact that densities are mixtures, the capacity of the underlying classifiers 
can be more effectively focused on easier classification tasks, resulting as expected in even 
more accurate approximations. 

As the results show, calibrating s(x) through univariate density estimation of p(s(x)) 
is key to obtaining accurate results. Standard histograms with uniform binning have been 
used here for illustrative purposes, but we anticipate that more sophisticated calibration 
strategies will be important in further development of this method. We leave this as an 
area for future study. 


Figure le shows the distribution of logf(x) for 7 = 0.05 using the decomposed ap¬ 
proximation of f(x) with neural networks and the distribution of the exact log-likelihood 
ratio. While there are some artifacts in the distribution in the low-probability regions and 
the maximum value of the log-likelihood ratio is underestimated, the overall shape of the 
distribution is well approximated. 

Finally, we come to the log-likelihood curve 


log A( 7 ) = log 


p(P|7) 


sup^e0P(^l7) 


(4.2) 


for the dataset T) shown in Fig. By exploiting Eqn. |3.6| , the generalized likelihood ratio 
can be computed by evaluating both terms with respect to a common reference 7 = 0 as 
outlined in Sec.[^ Figure pT| shows that the exact likelihood curve is very well approximated 
by the method, conhrming that even when the raw classiher does a poor job at modeling 
the s*(x), a good approximations of the likelihood ratio can still be obtained by calibrating 
s(x) (and by decomposing the mixture, if possible). 
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(c) logr(s(x)) using neural network (d) logr(s(x)) using random forest 




(e) |?(logf(s(x)) I 7 = 0.05) (f) —21ogA(7) 


Figure 1; Histogram of T> generated from 7 = 0.05 and plots illustrating various stages 
in the approximation of the log-likelihood ratio logr(x ;7 = 0.05,7 = 0) with calibrated 
classihers (see text). 
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(a) Exact vs. approximated MLEs. (b) p(— 21 ogA (7 = 0.05) | 7 = 0.05) 


Figure 2: Using approximated likelihood ratios for parameter inference yields an unbi¬ 
ased maximum likelihood estimator 7 , as empirically estimated from an ensemble of 1000 
artihcial datasets. 


An advantage of this approach compared to Approximate Bayesian Computation (Beau¬ 


mont et al., 2002 ) is that the classiher and calibration - computationally intensive parts of 


the approximation - are independent of the dataset V. Thus once trained and calibrated, 
the approximation can be applied to any dataset V. This makes it computationally efficient 
to perform ensemble tests of the method. 

Figure shows the empirical distribution of the maximum likelihood estimators (MLEs) 
from the approximate likelihood compared to the distribution of the MLEs from the exact 
likelihood. It clearly demonstrates that in this case the approximate likelihood yields an 
unbiased estimator with essentially the same variance as the exact MLE. In addition to 
the MLE, we can study the coverage of a conhdence interval based on the likelihood ra¬ 
tio test statistic. This is done by evaluating — 21 ogA (7 = 0.05) for samples drawn from 
p(x |7 = 0.05). Wilks’s theorem states that the distribution of — 21 ogA (7 = 0.05) should 
follow a Xi distribution. Figure]^ also conhrms this behavior, supporting the applicability 
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of this method for likelihood-based inference techniques in the likelihood-free setting. 


4.2 Parameterized inference from multidimensional data 


Let us now consider the more challenging problem of likelihood-free inference with multi¬ 
dimensional data. For the sake of the illustration, we will assume 5-dimensional feature x 
generated from the following process po\ 


1. z ;= (^ 0 , ^ 1 , ^ 2 , ^ 3 , ^ 4 ), such that Zq ~ A/'(/r = a, a = 1), Zi ~ A/'(/i = (3, a = 3), 
Z 2 ~ Mixture(Y 2 A^(/i = —2, a = 1), 1/2 A^(/i = 2, a = 0.5)), z^ ~ Exponential(A = 3), 
and Z 4 ~ Exponential A = 0.5); 

2. X := i?z, where R is a. hxed semi-positive dehnite 5x5 matrix dehning a hxed 
projection of z into the observed space. 


The observations V represented in Fig. |^are random samples with a = 1 and (3 = —1 
Our goal is to infer the values a and [3 based on V. We construct the log-likelihood ratio 

p{V\a,(3) 


- 21 ogA(a,/?) = -2 log 


(4.3) 


sup„,/3P(T>|a,/3) 

that we calculate via Eqn. 3.6[ Following the procedure described in Sec. 3^ we build a 
single 2-layer neural network (with 5-1-2 inputs and one output node) to form the parame¬ 
terized classiher s(x; 6 * 0 , Oi) and hx 6*1 = (a = 0, /3 = 0). Since the generative model is not 
expensive, the classiher output is calibrated on-the-hy with histograms for every candidate 
parameter pair {a,/3). 

Figure [4^ shows the exact log-likelihood ratio for this dataset, which has an exact MLE 
at {a = 1.012,/3 = —0.9221). Figure 4b shows the approximate log-likelihood ratio eval¬ 


uated on a coarse grid of parameter values. Some roughness in the contours is observed, 
which is primarily due to variance introduced in the calibration procedure. In addition 
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Figure 4: Inference from exact and approximate likelihood ratios. The red dot corresponds 
to the true values (a = l,/3 = —1) used to generate V, the green dot is the MLE from 
the exact likelihood, while the blue dot is the MLE from the approximate likelihood. 1, 


2 and S-a contours are shown in white. (4a) The exact —21ogA(a,/9) for the observed 


data V. (4b) The approximate —21ogA(a,/3) evaluated on a coarse 15 x 15 grid. (4c) A 
Gaussian Process surrogate of —2 log A(a, /d) ratio estimated from a Bayesian optimization 
procedure. White dots show the parameter points sampled during the optimization process. 
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to the statistical fluctuations due to flnite calibration samples, there are also fluctuations 
introduced from changes in the binning of the calibration histograms as a and f3 vary. As 


discussed in Sec. |3.3[ a parameterized calibration procedure should ameliorate this issue, 
but that is left for now as an area for future work. Nevertheless, optimizing the approxi¬ 


mate log-likelihood ratio with a Bayesian optimization (Brochu et al., 2010 GPyOpt, 2015) 
procedure is efficient and effective. After 50 likelihood evaluations, the maximum likelihood 
estimate is found at (d = 1.008,/3 = —1.004). While the objective of the Bayesian opti¬ 
mization procedure is to And the maximum likelihood, the posterior mean of the internal 


Gaussian process, shown in Fig. is close to the exact log-likelihood ratio illustrated in 
Fig. 1^ In each case, the true values a = 1 and (3 = —1 are contained within the 1 — a 
likelihood contour. 


Finally, we evaluate the diagnostics described in Sec. |3.5| for this example. To aid in 
visualization, we restrict to a 1-dimensional slice of the likelihood along a with /3 = —1. We 
consider three situations: i) a poorly trained, but well calibrated classifier; ii) a well trained, 
but poorly calibrated classifier; and iii) a well trained, and well calibrated classifier. For 
each case, we employ two diagnostic tests. The first checks for independence of —2 log A(6*) 


with respect to changes in the reference value 9i as shown in Eqn. |3.6[ The second uses a 
classifier to distinguish between samples from p(x|6*o) and samples from ^(xl^i) weighted 


according to r(x; ^q, ^i). As discussed for Fig. 4b, statistical fluctuations in the calibration 
lead to some noise in the raw approximate likelihood. Thus, we show the posterior mean of a 
Gaussian processes resulting from Bayesian optimization of the raw approximate likelihood 


as in Fig. 5c In addition, the standard deviation of the Gaussian process is shown for one 
of the 6 i reference points to indicate the size of these statistical fluctuations. It is clear that 
in the well calibrated cases that these fluctuations are small, while in the poorly calibrated 
case these fluctuations are large. Moreover, in Fig. I^we see that in the poorly trained. 
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(a) Poorly trained, well calibrated. 


(b) Poorly trained, well calibrated. 




(c) Poorly calibrated, well trained. 


(d) Poorly calibrated, well trained. 




(e) Well trained, well calibrated. 


(f) Well trained, well calibrated. 


Figure 5: Results from the diagnostics described in Sec. 3.5 The rows correspond to the 


quality of the training and calibration of the classiher. The left plots probe the sensitivity 
to 6 i, while the right plots show the ROC curve for a calibrator trained to discriminate 
samples from p(x|6*o) and samples from p(x|6*i) weighted as indicated in the legend. 
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well calibrated case the classifier s(x; 9o,9i) has a significant dependence on the 9i reference 


point. In contrast, in Fig. 5c the likelihood curves vary significantly, but this is comparable 


to the fluctuations expected from the calibration procedure. Finally, Fig. [5e] shows that in 
the well trained, well calibrated case that the likelihood curves are all consistent with the 
exact likelihood within the estimated uncertainty band of the Gaussian process. The ROC 
curves tell a similarly revealing story. As expected, the classifier is not able to distinguish 
between the distributions when p(x|6*i) is weighted by the exact likelihood ratio. We can 
also rule out that this is a deficiency in the classifier because the two distributions are well 
separated when no weights are applied to p(x|6'i). In both Fig. 5b and Fig. 5d the ROC 
curve correctly diagnoses deficiencies in the approximate likelihood ratio f(s(x; ^i))- 
Finally, Fig. |5f| shows that the ROC curve in the well trained, well calibrated case is almost 
identical with the exact likelihood ratio, confirming the quality of the approximation. 

Overall, this example further illustrates and confirms the ability of the proposed method 
for inference with multiple parameters and multi-dimensional data where reliable approxi¬ 
mations p(x|6*o) and ^(xl^i) are often difficult to construct. 


4.3 High energy physics 

High energy physics was the original scientific domain that motivated the development 
of this procedure. In high energy physics, we are often searching for some class of events, 
generically referred to as signal, in the presence of a separate class of background events. For 
each event we measure some quantities x, with corresponding distributions ps(x|i/) for signal 
and pb(x|z/) for background, where u are nuisance parameters describing uncertainties in 
the underlying physics prediction or response of the measurement device. The total model 
is a mixture of the signal and background, and /i is the mixture coefficient associated to 
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the signal component, that is 


p(I>|fi, 1^) = JJ lw.(x|i^) + (1 - p)ps(x|i/)] . 


(4.4) 


xex> 


Accordingly, new particle searches are typically framed as hypothesis tests where the nnll 
corresponds to p = 0, and the generalized likelihood ratio is used as a test statistic. 

Nuisance parameters are an after thought in the typical usage of machine learning in 
high energy physics. The classihers are typically trained with data generated using a hxed 
nominal value of the nuisance parameters v = vq. However, as experimentalists we know 
that we must account for the systematic uncertainties that correspond to the nuisance 
parameters v. Thus, typically we take the classiher s(x) as hxed and then propagate 
uncertainty by estimating ps(s(x)|z/) with a parameterized calibration procedure. However, 
this classiher is clearly not optimal for u ^ uq. In contrast, a parameterized classiher 
proposed in this work would yield more accurate estimates of the generalized likelihood 
ratio. 

In addition to robustness to systematic uncertainties incorporated by the nuisance pa¬ 
rameters u, the proposed method can be used to infer parameters of interest. Not only can 
the mixture coefficient p be inferred using the decomposition procedure, but also physical 
parameters like particle masses that change the distribution of x. This formalism represents 
a signihcant step forward in the usage of machine learning in high energy physics, where 
classihers have always been used between two static classes of events and not parameterized 
explicitly in terms of the physical quantities we wish to measure. 

Another approach for parameter inference with multi-dimensional data specihc to high 
energy physics is the so-called matrix element method, in which one directly computes an 
approximate likelihood ratio by performing a computationally intensive integral associated 


to a simplihed detector response (Volobouev, 2011). In the approach considered in this 
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paper, the detailed detector response is naturally incorporated by the simulator; however, 
that integral is intractable for the matrix element method. Even with drastic simplihcations 
of the detector response, the matrix element method can take several minutes of CPU time 
to calculate the likelihood ratio for a single event x. The work here can be seen as aiming at 
the same conceptual target, but relying on machine learning to overcome the complexity of 
the detector simulation. It also offers enormous speed increase for evaluating the likelihood 
at the cost of an initial training stage. In practice, the matrix element method has only 
been used for searches and measurement of a single physical parameter (sometimes with a 


single nuisance parameter as in (Aaltonen et ah, 2010)). 


Contemporary examples where the technique presented here could have major impact 
include the measurement of coefficients to quantum mechanical operators describing the 


production and decay of the Higgs boson (Chen et ah, 2015) and, if we are so lucky. 


measurement of the mass of supersymmetric particles in cascade decays (Allanach et ah 


2000). Both of these examples involve data sets with many events, each with a feature 


vector X that has on the order of 10 components, and a parameter vector 6 with 2-10 
parameters of interest and possibly many more nuisance parameters. 

5 Related work 


The closest work to the proposed method is due to Neal (2007), who similarly considers the 


problem of approximating the likelihood function when only a generative model is available. 
That work sketches a scheme in which one uses a classiher with both x and 6 as an input 
to serve as a dimensionality reduction map. The key distinction comes in the handling of 
6 . Neal argues that a classiher cannot be used on real data, since we do not know the 
correct value for 6 , and goes on to outline an approach where one uses regression on a 
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per-event basis to estimate d(x) and perform the composition s(x;6'(x)). As pointed out 
by the author, this can lead to a signihcant loss of information since a single observation 
X may carry little information about the true value of 0, though a full data set V may 


be informative. The work of Neal (2007) correctly identihes this as an approximation 
of the target likelihood even in the case of a ideal classiher. In contrast, the approach 
described here does not eliminate the dependence of the classiher on 0. Instead, we embed 
a parameterized classiher into the likelihood and postpone the evaluation of the classiher 
to the point of evaluation of the likelihood when 0 is explicitly being tested. This avoids 
the loss of information that occurs from the regression step 0(x) proposed by 


Neal 


( |2007D 

and leads to Thm. [T| which is an exact result in the case of an ideal classiher. In both 
cases, the quality of the classiher is factorized from the calibration of its density, which 
allows for valid inference even if there is a loss of power due to a non ideal classiher. 


Also close to our work, Scott and Nowak (2005) and Xin Tong (2013) consider the 
machine learning problem associated to Neyman-Pearson hypothesis testing. In a simi¬ 
lar setup, they consider the situation where one does not have access to the underlying 
distributions, but only has i.i.d. samples from each hypothesis. This work generalizes 
that goal from the Neyman-Pearson setting to generalized likelihood ratio tests and em¬ 


phasizes the connection with classihcation. Drier et ah (2004) take on a diherent problem 
(tests of statistical independence) by using machine learning algorithms to hnd scalar maps 
from the high-dimensional feature space that achieve the desired statistical goal when the 
fundamental high-dimensional test is intractable. 

More generally, likelihood ratio testing directly relates to the density ratio estimation 
problem, which consists in estimating the ratio of two densities from hnite collections of 
observations Vq and Vi. Density ratio estimation is connected to many machine learning 


fundamental problems, including transfer learning (Sugiyama and Kawanabe, 2012), prob- 
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abilistic classification and regression (Vapnik, 1998), outlier detection (Hido et al., 2011), 


and many others. For learning under covariate shift, Shimodaira (2000) and Sugiyama and 


Muller] (2005) estimate the density ratio r(x;9o,9i) from straightforward approximations 


p(x|6*o) and p(x|6*i) separately obtained using kernel density estimation. Despite its theo¬ 


retical consistency, this approach is known to be ineffective in practice (Sugiyama et al. 


2007 Bickel et ah, 2009), since it relies on modeling numerator and denominator high¬ 


dimensional densities, which is a harder problem than modeling their ratio only. While the 
proposed method also proceeds in two similar steps, estimating p(s(x)) is much easier than 
estimating p(x), since s projects x into a one-dimensional space in which only the infor¬ 
mative content of r(x) is preserved. Finally, in contrast with the proposed method which 
decouples reduction from calibration, other approaches proposed within the literature (see 


Sugiyama et al. (2012); Gretton et al. (2009); Nguyen et al. (2010); Vapnik et al. (2013) 


and references therein) provide solutions for estimating r(x;0o,^i) directly from x, in one 
step. Under some assumptions, the convergence of the obtained estimates is also proven 
for some of these approaches. 


6 Conclusions 

In this work, we have outlined an approach to reformulate generalized likelihood ratio tests 
with a high-dimensional data set in terms of univariate densities of a classiher score. We 
have shown that a parameterized family of discriminative classihers s(x; 9o,9i) trained and 
calibrated with a simulator can be used to approximate the likelihood ratio, even when it is 
not possible to directly evaluate the likelihood ^(xl^). The proposed method offers an alter¬ 
native to Approximate Bayesian Computation for parameter inference in the likelihood-free 
setting that can also be used in the frequentist formalism without specifying a prior over 
































the parameters. In contrast to approaches that learn the posterior conditional on V, our 
approach can be applied to any observed data V once trained. A strength of this approach 
is that it separates the quality of the approximation of the target likelihood from the qual¬ 
ity of the calibration. The former leverages the continuing advances in supervised learning 
approaches to classihcation. The calibration procedure for a particular parameter point is 
fairly straightforward since it involves estimating a univariate density using a generative 
model of the data. The difficulty of the calibration stage is performing this calibration 
continuously in 6. Different strategies to this calibration are anticipated depending on the 
dimensionality of 6, the complexity of the resulting likelihood function, or the practical 
issues associated to running the simulator. 
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A Probabilistic classification for building s 

In this appendix, we show for completeness that the probabilistic classification framework 
yields a reduction s which satisfies conditions of Thm. 

Proposition 2. Let X = {Xi,Xp) and Y be random input and output variables with 
values in X CRP and y = {0,1} and mixed joint probability density function pxxix^y)- 
For the squared error loss, the best regression function s : A i—>■ [0,1], or equivalently the 
best probabilistic classifier, is 

P{Y = l)px|y(x|F = 1) 


S X = 


PiY = 0)px|y(x|F = 0) + P(F = l)px|y (x|F = 1)' 
Proof. For the squared error loss, 


(A.l) 


s*(x) = argminEy|x=x{(>^ - s(x))^} 

s(x) 

= argminEy|x=x{h'^} - 2s(x)Ey|x=x{>^} + s(x)^ 

s(x) 

= argmin-2s(x)Ey|x=x{h"} + s(x)^ 

s(x) 


(A.2) 


The last expression is minimized when 2s(x)Ey|x=x{d^} + 'S(x)^) = 0, that is when 

—2Ey|x=x{b^} + 2s(x) = 0, hence 


(X) = Ey|x=x{>^}. 


(A.3) 
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For y = {0,1}, 


Ey|x=x{l"} = PiY = 0|X = x) X 0 + PiY = 1|X = x) X 1 
P{Y = l)px|y(x|F = 1) 


Px(x) 


p(Y = iK|Y(x|y = 1) 


PiY = 0)px|y (x|F = 0) + PiY = lK|y (x|F = 1)' 


(A.4) 

□ 


For PiY = 0) = PiY = 1) = 1 / 2 , the best regression function s* simplifies to 

px|y(x|F = 1) 


s X = 


(A.5) 


Px|y(x|F = 0) +px|y(x|F = 1)‘ 

If we further assume that samples for y = 0 (resp. y = 1) are drawn from some pa¬ 
rameterized distribution with probability density px(x|6'o) (resp. px(x|6*i)), then the best 
regression function can be rewritten as 

Px(x|0i) 


s X = 


(A.6) 


Px(x|6'o) +px(x|6'i)‘ 

In particular, this regression function satishes conditions of Thm.[^since s*(x) = m(r(x; ^i)), 
for m(r(x)) = (1 -|- r(x))“^, is monotonic with r(x; 60, 61). 

Proposition]^ holds for the squared error loss, but it can be similarly shown that classi- 
hers minimizing the exponential loss, the binomial log-likelihood (or cross-entropy) or the 
squared hinge loss are also monotonic with the density ratio (Friedman et al., 2000; Lin| 


2002). However, a classiher with discrete outputs and minimizing the zero-one loss does 


not satisfy conditions of the theorem. 
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